INTERSPEECH.2004 - Speech Processing

Total: 95

#1 Scalable distributed speech recognition using multi-frame GMM-based block quantization [PDF] [Copy] [Kimi]

Authors: Kuldip K. Paliwal ; Stephen So

In this paper, we propose the use of the multi-frame Gaussian mixture model-based block quantizer for the coding of Mel frequency-warped cepstral coefficient (MFCC) features in distributed speech recognition (DSR) applications. This coding scheme exploits intraframe correlation via the Karhunen-Loeve transform (KLT) and interframe correlation via the joint processing of adjacent frames together with the computational simplicity of scalar quantization. The proposed coder is bit-rate scalable, which means that the bitrate can be adjusted without the need for re-training of the quantizers. Static parameters such as the probability density function (PDF) model and KLT orthogonal matrices are stored at the encoder and decoder and bit allocations are calculated 'on-the-fly' without intensive processing. This coding scheme is evaluated in this paper on the Aurora-2 database in a DSR framework. It is shown that this coding scheme achieves high recognition performance at lower bitrates, with a word error rate (WER) of 2.5% at 800 bps, which is less than 1% degradation from the baseline word recognition accuracy, and graceful degradation down to a WER of 7% at 300 bps.

#2 Robust speech recognition over packet networks: an overview [PDF] [Copy] [Kimi]

Authors: Naveen Srinivasamurthy ; Kyu Jeong Han ; Shrikanth Narayanan

Conventional circuit-switched networks are increasingly being replaced by packet-based networks for voice communication applications. Additionally, there has been an increased deployment of services supporting speech based interactions. These trends demand reliable transmission of speech data not just for playback but also to ensure acceptable automatic speech recognition (ASR) performance. In this paper, we present an overview of techniques that have been investigated to improve ASR performance against two major degradation factors in the context of packet networks: (1) information loss due to a low bit-rate codec and (2) packet loss due to channel (network) conditions. In addition, we highlight another key issue, packet loss rate, by showing ASR performance as a function of packet size and channel condition.

#3 Theory for speaker recognition over IP [PDF] [Copy] [Kimi]

Authors: Thomas Eriksson ; Samuel Kim ; Hong-Goo Kang ; Chungyong Lee

In this paper, we develop theory for speaker recognition, based on information theory. We show that the performance of a speaker recognition system is closely connected to the mutual information between features and speaker, and derive upper and lower bounds for the performance. We apply the theory to the case when the speech is coded and transmitted over a packet-based channel, in which packet losses occurs. The theory gives important insights in what methods can be used to improve the recognition performance, and what methods are meaningless.

#4 Voice portal services in packet network and voIP environment [PDF] [Copy] [Kimi]

Authors: Wu Chou ; Feng Liu

In this paper, we study the voice portal services in packet network and VoIP environment. An extensible VoIPTeleserver for VoIP in SIP (Session Initiation Protocol) environment is described. It is based on the concept of dialogue system and web convergence that separates the channel dependent media resources from the application creation environment. It supports XML based service applications for multiple channels including voice, DTMF, IM and chat over IP. Special attention is given to the adverse effect of delay, jitter and packet loss for voice portal services over IP. In particular, case studies of DTMF service in voice portal under adverse channel conditions are performed. The compounding effects of multiple channel impairments to DTMF in voice portal services over IP are revealed. The potential high error rate indicates that the data redundancy method as proposed in RFC 2198 is needed for DTMF in order to achieve reliable voice portal services over IP.

#5 Synchronization of speaker selection for centralized tandem free voIP conferencing [PDF] [Copy] [Kimi]

Authors: Peter Kabal ; Colm Elliott

Traditional teleconferencing uses a select-and-mix function at a centralized conferencing bridge. In VoIP environments, this mixing operation can lead to speech degradation when using high compression speech codecs due to tandem encodings and coding of multi-talker signals. A tandem-free architecture can eliminate tandem encodings and preserve speech quality. VoIP conference bridges must also consider the variable network delays experienced by different packetized voice streams. A synchronized speaker selection algorithm at the bridge can smooth out network delay variations and synchronize incoming voice streams. This provides a clean mapping of the N input packet streams to the M output streams representing selected speakers. This paper presents a synchronized speaker selection algorithm and evaluates its performance using a conference simulator. The synchronization process is shown to account for only a small part of the overall delay experienced by selected packets.

#6 Measuring the perceived importance of time- and frequency-divided speech blocks for transmitting over packet networks [PDF] [Copy] [Kimi]

Authors: Akitoshi Kataoka ; Yusuke Hiwasaki ; Toru Morinaga ; Jotaro Ikedo

This paper presents a way to calculate the perceived importance of speech segments as a single value criterion, using a linear regression model. Unlike the commonly used voice activity detection (VAD) algorithms, this method allows us to obtain a finer priority granularity of speech segments. This can be used in conjunction with frequency scalable speech coding techniques and IP QoS techniques to achieve efficient and quality-controlled voice transmission. A simple linear regression model is used to calculate the estimated mean opinion score (MOS) of the various cases of missing speech segments.

#7 Comparison of transmitter - based packet-loss recovery techniques for voice transmission [PDF] [Copy] [Kimi]

Authors: Moo Young Kim ; W. Bastiaan Kleijn

To facilitate real-time voice communication through the Internet, forward error correction (FEC) and multiple description coding (MDC) can be used as low-delay packet-loss recovery techniques. We use both a Gilbert channel model and data obtained from real IP connections to compare the rate-distortion performance of different variants of FEC and MDC. Using identical overall rates with stringent delay constraints, we find that side-distortion optimized MDC generally performs better than Reed-Solomon based FEC. If the channel condition is known from feedback through the Real-Time Control Protocol (RTCP), then channel-optimized MDC can be used to exploit this information, resulting in significantly improved performance.

#8 A first experience on multilingual acoustic modeling of the languages spoken in morocco [PDF] [Copy] [Kimi]

Authors: José B. Marino ; Asuncion Moreno ; Albino Nogueiras

The goal of this paper is to explore and describe the potential of multilingual acoustic models for automatic speech recognition of the languages spoken in Morocco. The basic experimental framework comes from the OrienTel project, mainly the sound inventory of the Arabic languages and the speech databases. Monolingual and multilingual automatic speech recognition systems for Modern Colloquial and Standard Arabic (MCA and MSA, respectively) and French languages are developed and evaluated, in order to envisage the phonetic exchange and similarity among the three languages. As a main result, it can be stated that a combined modeling of MSA and MCA or, even a trilingual design, does not harm the performance of the recognition system.

#9 Data driven multidialectal phone set for Spanish dialects [PDF] [Copy] [Kimi]

Authors: Monica Caballero ; Asuncion Moreno ; Albino Nogueiras

This paper addresses the use of a data-driven approach to determine a multidialectal phone set for an automatic speech recognition system for Spanish dialects. This approach is based on a decision tree clustering algorithm that tries to cluster contextual units of different dialects. This procedure avoids the definition of a global phonetic inventory and the previous study of similarity of sounds. The procedure is applied in Spanish as spoken in Spain, Colombia and Venezuela. Results show differences between phonemes that share the same SAMPA symbol in different dialects and also detect similarities between phonemes that are represented by different symbols in dialectal variants. Recognition results using this multidialectal approach overcome the monodialectal ones.

#10 Multilingual e-mail text processing for speech synthesis [PDF] [Copy] [Kimi]

Authors: Daniela Oria ; Akos Vetek

An integrated method of text pre-processing and language identification is introduced to deal with the problem of mixed-language e-mail messages in a speech-enabled e-mail reading system. Our method can confidently distinguish between the supported languages and switch between several TTS engines or languages to read the portions of the text in the appropriate language. This is achieved by making use of the combined information from a text pre-processor and a language identifier that relies on both statistical information and linguistic features indicative of a particular language.

#11 Multi-context rules for phonological processing in polyglot TTS synthesis [PDF] [Copy] [Kimi]

Authors: Harald Romsdorfer ; Beat Pfister

Polyglot text-to-speech synthesis, i.e. the synthesis of sentences containing one or more inclusions from other languages, primarily depends on an accurate morpho-syntactic analyzer for such mixed-lingual texts. From the output of this analyzer, the pronunciation can be derived by means of phonological transformations which are language-specific and depend on various contexts. In this paper a new rule formalism for such phonological transformations is presented, which complies also with the requirements of the mixed-lingual situation.

#12 A general approach to TTS reading of mixed-language texts [PDF] [Copy] [Kimi]

Authors: Leonardo Badino ; Claudia Barolo ; Silvia Quazza

The paper presents the Loquendo TTS approach to mixed-language speech synthesis, offering a range of options to face the various situations where texts may occur in different languages or embedding foreign phrases. The most challenging target is to make a monolingual TTS voice read a foreign language text. The adopted Foreign Pronunciation Strategy here discussed allows mixing phonetic transcriptions of different languages, relying on a Phoneme Mapping algorithm making foreign phoneme sequences pronounceable by monolingual voices. The algorithm extends previous solutions, obtaining a plausible approximated pronunciation. The method is efficient, language independent, entirely phonetics-based and it enables any Loquendo TTS voice to speak all the languages provided by the system.

#13 Context dependent statistical augmentation of persian transcripts [PDF] [Copy] [Kimi]

Authors: Panayiotis G. Georgiou ; Shrikanth S. Narayanan ; Hooman Shirani Mehr

Persian language is transcribed in a lossy manner as it does not, as a rule, encode vowel information. This renders the use of the written script suboptimal for language models for speech applications or for statistical machine translation. It also causes the text-to-speech synthesis from a Persian script input to be a one-to-many operation. In our previous work, we introduced an augmented transcription scheme that eliminates the ambiguity present in the Arabic script. In this paper, we propose a method of generating the augmented transcription from the Arabic script by statistically decoding through all possibilities and choosing the maximum likelihood solution. We demonstrate that even with a small amount of initial bootstrap data, we can achieve a decoding precision of about 93% with no human intervention. The precision can be as high as 99.2% in a semi-automated mode where low confidence decisions are marked for human processing.

#14 A soft decision MMSE amplitude estimator as a noise preprocessor to speech coder s using a glottal sensor [PDF] [Copy] [Kimi]

Authors: Cenk Demiroglu ; David V. Anderson

A soft-decision Ephraim-Malah suppression rule based speech enhancement algorithm is proposed for intelligibility enhancement in parametric speech coders. A glottal sensor is used to improve the intelligibility of a baseline system that uses only the acoustic microphone. The objective measure test shows that the proposed system decreases the spectral distortion by 2-3 dB for most phonetic classes. Moreover, significant improvements in DRT scores of nasality and sibilation features are obtained compared to the baseline system when the noise suppression systems are concatenated with a MELP based speech coder.

#15 Single acoustic-channel speech enhancement based on glottal correlation using non-acoustic sensor [PDF] [Copy] [Kimi]

Authors: Rongqiang Hu ; David V. Anderson

This paper describes a single acoustic-channel speech enhancement, utilizing an auxiliary non-acoustic sensor. Unlike classical algorithms, which make use of the knowledge from acoustic signal alone, the glottal correlation (GCORR) algorithm takes advantage of non-acoustic throat sensors such as the general electromagnetic motion sensor (GEMS). The non-acoustic sensor provides a measure of the glottal excitation function that is relatively immune to background acoustic noise. Thus, inspired by human speech production mechanisms, the GCORR algorithm extracts the desired speech signal from noisy acoustic mixture using statistical correlation between the speech and its excitation. The algorithm leads to a significant reduction of wide-band noise, even when the SNR is very low. The improvement in the quality of the speech is demonstrated in terms of an objective evaluation.

#16 In-vehicle based speech processing for hearing impaired subjects [PDF] [Copy] [Kimi]

Authors: Xianxian Zhang ; John H. L. Hansen ; Kathryn Arehart ; Jessica Rossi-Katz

It is very important to help hearing impaired people to be able to do normal work as normal hearing people can do, such as driving a car. While there have been numerous studies in the field of speech enhancement for car noise environments, the majority of these studies have focused on noise reduction for normal hearing individuals. In this paper, we present recent results in the development of more effective speech capture and enhancement processing for wireless voice interaction between subjects with hearing loss in real car environments. We first present a data collection experiment for a proposed FM wireless transmission scenario using a 5-channel microphone array in the car, followed by several alternative speech enhancement algorithms. After formulating 6 different processing methods, we evaluate the performance by SegSNR improvement using data recorded in a moving car environment. Among the 6 processing configurations, the combined fixed/adaptive beam-forming (CFA-BF) obtains the highest level of SegSNR improvement by up to 2.65 dB.

#17 Speech enhancement using adaptive time-domain segmentation [PDF] [Copy] [Kimi]

Authors: Sriram Srinivasan ; W. Bastiaan Kleijn

In this paper, we investigate the benefits of using an adaptive segmentation of the speech signal in speech enhancement. The adaptive segmentation scheme divides the signal into the longest segments within which stationarity is preserved, thus providing a good time-frequency resolution. The segmentation is performed with the help of an orthogonal library of local cosine bases using a computationally efficient tree-structured best-basis search. We show that such an adaptive segmentation results in improved speech enhancement compared to a fixed segmentation. The resulting enhanced speech is free from musical noise, without any additional smoothing.

#18 Harmonicity based monaural speech dereverberation with time warping and F0 adaptive window [PDF] [Copy] [Kimi]

Authors: Tomohiro Nakatani ; Keisuke Kinoshita ; Masato Miyoshi ; Parham S. Zolfaghari

Although a number of dereverberation methods have been reported, dereverberation is still a challenging problem especially when a single microphone is used. To overcome this problem, we proposed a harmonicity based dereverberation method (HERB). HERB can blindly estimate the inverse filter of a room transfer function based on the harmonicity of speech signals and dereverberate the signals. However, HERB uses an imprecise assumption that hinders the dereverberation performance, that is, the fundamental frequency (F0) of a speech signal is assumed to be constant within a short time frame when extracting the features of harmonic components. In this paper, we combine HERB with time warping analysis and an F0 adaptive window to remove this bottleneck. This extension makes it possible to estimate harmonic components precisely even when their frequencies change rapidly. Experiments show that time warping analysis with an F0 adaptive window can effectively improve the dereverberation effect of HERB.

#19 Dereverberation of speech signals based on linear prediction [PDF] [Copy] [Kimi]

Authors: Marc Delcroix ; Takafumi Hikichi ; Masato Miyoshi

This paper proposes an algorithm for the blind dereverberation of speech signals based on a two-channel linear prediction. Traditional dereverberation methods usually achieve good performance when the input signal is white noise. However, when dealing with colored signals generated by an autoregressive (AR) process such as speech, the generating AR process is deconvolved causing excessive whitening of the signal. This paper proposes a blind dereverberation algorithm that recovers speech signals suffering from deterioration due to the reverberation in a room. We overcome the whitening problem faced by traditional methods by estimating the generating AR process and applying this estimated AR process to the whitened signal. Simulation results show the great potential of the proposed method.

#20 N-gram language modeling of Japanese using bunsetsu boundaries [PDF] [Copy] [Kimi]

Authors: Sungyup Chung ; Keikichi Hirose ; Nobuaki Minematsu

A new scheme of N-gram language modeling was proposed for Japanese, where word N-grams were calculated separately for the two cases: crossing and not crossing bunsetsu boundaries. Here, bunsetsu is a basic grammatical (and pronunciation) unit of Japanese. Similar scheme using accent phrase boundaries instead of bunsetsu boundaries has already been proposed by the authors with a certain success, but it suffered from the training data shortage, because assignment of accent phrase boundaries requires a speech corpus. In contrast, bunsetsu boundaries can be detected automatically from a written text with a rather high accuracy using parsers. Experiments showed that perplexity reduction and word recognition rate improvement, especially in case of small training corpus, were possible by estimating bunsetsu boundaries from the history longer than N-1 words in the case of N-gram modeling and by selecting one from two types of models (crossing and not crossing bunsetsu boundaries) according to the estimation.

#21 Dynamic language modeling for broadcast news [PDF] [Copy] [Kimi]

Authors: Langzhou Chen ; Lori Lamel ; Jean-Luc Gauvain ; Gilles Adda

This paper describes some recent experiments on unsupervised language model adaptation for transcription of broadcast news data. In previous work, a framework for automatically selecting adaptation data using information retrieval techniques was proposed. This work extends the method and presents experimental results with unsupervised language model adaptation. Three primary aspects are considered: (1) the performance of 5 widely used LM adaptation methods using the same adaptation data is compared; (2) the influence of the temporal distance between the training and test data epoch on the adaptation efficiency is assessed; and (3) show-based language model adaptation is compared with story-based language model adaptation. Experiments have been carried out for broadcast news transcription in English and Mandarin Chinese. A relative word error rate reduction of 4.7% was obtained in English and a 5.6% relative character error rate reduction in Mandarin with story-based MDI adapation.

#22 A unified framework for large vocabulary speech recognition of mutually unintelligible Chinese "regionalects" [PDF] [Copy] [Kimi]

Authors: Ren-Yuan Lyu ; Dau-Cheng Lyu ; Min-Siong Liang ; Min-Hong Wang ; Yuang-Chin Chiang ; Chun-Nan Hsu

In this paper, a new approach is proposed for recognizing speech of mutually unintelligible spoken Chinese regionalects based on a unified three-layer framework and a one-stage searching strategy. This framework includes (1) a unified acoustic model for all the considered regionalects; (2) a multiple pronunciation lexicon constructed by both a rule-based and a data-driven approaches; (3) a one-stage searching network, whose nodes represent the Chinese characters with their multiple pronunciations. Unlike the traditional approaches, the new approach avoids searching the intermediate local optimal syllable sequences or lattices. Instead, by using the Chinese characters as the searching nodes, the new approach can search to find the globally optimal character sequences directly. This paper reports the experiments on two of the Chinese regionalects, i.e., Taiwanese and Mandarin. Results show that the unified framework can efficiently deal with the issues of multiple pronunciations of the spoken Chinese regionalects. The character error reduction rate is 34.1%, which is achieved by using the new approach compared with the traditional two-stage scheme. Furthermore, the new approach is shown more robust when dealing with the poor uttered speech database.

#23 The influence of target size and distance on the production of speech and gesture in multimodal referring expressions [PDF] [Copy] [Kimi]

Authors: Ielka van der Sluis ; Emiel Krahmer

In this paper we report on a production experiment for multimodal referring expressions. Subjects performed an object identification task in an interactive setting. 20 subjects participated and were asked if they could identify 30 countries on a world map on the wall. Subjects performed their tasks on two distances: close (10 subjects) and at a distance of 2.5 meters (10 subjects). The assumption is that these conditions yield precise and imprecise pointing gestures respectively. In addition we varied the 'size' of target objects (large or isolated objects versus small objects). This study resulted in a corpus of 600 multimodal referring expressions. A statistical analysis (ANOVA) revealed a main effect of distance (subjects adapt their language to the kind of pointing gesture) and also a main effect of target (smaller objects are more difficult to describe than large or isolated objects).

#24 Dynamic time windows for multimodal input fusion [PDF] [Copy] [Kimi]

Authors: Anurag Kumar Gupta ; Tasos Anastasakos

Natural interaction in multimodal dialogue systems demands quick system response after the end of a user turn. The prediction of the end of user input at each multimodal dialog turn is complicated as users can interact through modalities in any order, and convey a variety of different messages to the system within the turn. Several multimodal interaction frameworks have used fixed-duration time windows to address this problem. We conducted a user study to evaluate the use of fixed-duration time windows and motivate further improvements. This paper describes a probabilistic method for computing an adaptive time window for multimodal input fusion. The goal is to adjust the time window dynamically depending on the user, task, and the number of multimodal inputs for each turn. Experimental results show that the resulting system has superior performance when compared to a system with fixed-duration time windows.

#25 MICot : a tool for multimodal input data collection [PDF] [Copy] [Kimi]

Authors: Raymond H. Lee ; Anurag Kumar Gupta

In this paper, a multi-modal data collection tool called MICoT is described. We highlight the various design and implementation aspects that we consider to be important for MICoT. An example is given to illustrate the application of the tool to collect data for our research in multi-modal dialog system.